The Banking Inducstry has experienced massive growth over the years and with many new high tech entrants, competition has also increased tremendously.
There are limitless number of options where customers can put their money, from traditional Banks to online solid startups. To understand how your customers are likely to behave will equip the bank in addressing issues that in the first place cause them to leave.
Customer experience or customer service has been cited as the number one reason why clients move banks followed by unfavourable fees among other reasons such as lack of new digital products.
Once customers leave, it is hard to get them back. For this reason post event analysis will only give reasons why they left and cannot prevent the action from happening. However, with the help of AI, we can use the same data to predict who might leave and therefore have an opportunity to seek to understand why.
For most companies, the customer acquisition cost (cost of acquiring a new customer) is higher than the cost of retaining an existing customer. Therefore, the challenge of implementing a successful churn project is to increase customer loyalty and, consequently, increase company revenue.
It is therefore neccessary to analyze data of customers who left to find insights which might help us predict the ones who are likely to leave and develop plans to reduce these number.
In trying to gain insights into the data, I create visual representations that aggregate and summarize the following:
1. Creating Customer Segmentation based on behavior, characteristics, patterns and address the question, "Which Customer do we care about? The best vs the most valuable.
2. Compare to Control population: Understanding one time customers vs regular engaged ones
3. Identifying what makes your Churner different
In looking for a solution, I determine the best approach and algorithms by first:
1. Finding relevant features, then
2. Computing a Churn Score
1 - ID (numeric)
2 - Surname (String)
3 - age (numeric)
4 - Credit Score (Numeric)
5 - Geographical Area : (categorical: 'France', 'German', 'Spain')
6 - Gender : (categorical: 'Male' or 'Female')
7 - Tenure : Number of years member has been with Bank (Numeric)
8 - Balance: Amount in Account (Numeric)
9 - NumofProducts: Number of Products member has with Bank (Numeric)
10 - HasCrCard: Has Credit Card (binary: '1 = yes',0 = 'no')
11 - IsActiveMember: Is Member Still current Account Holder (binary: '1 = yes',0 = 'no')
12 - EstimatedSalary: Estimated Salary (Numeric)
13 - y - has the client Exited? (binary: '1 = yes',0 = 'no')
!python -m pip install scikit-learn==0.22.0 --user
# Importing Libraries
from __future__ import print_function
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #visualization
sns.set()
import matplotlib.pyplot as plt #visualization
%matplotlib inline
import itertools
import warnings
warnings.filterwarnings("ignore")
import os
import io
import plotly.offline as py #visualization
py.init_notebook_mode(connected=True) #visualization
import plotly.graph_objs as go #visualization
import plotly.tools as tls #visualization
import plotly.figure_factory as ff #visualization
# Required to make it work in Google Colab
from plotly import __version__
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
cf.go_offline()
Setting Notebook Mode to False
init_notebook_mode(connected=False)
# training dataset import
#cust_data = pd.read_csv("C:\\Users\\Welcome\\PROF790\\Bank_Churn\\Churn_Modelling.csv")
cust_data = pd.read_csv("Churn_Modelling.csv")
cust_data.head(10)
#The train dataset is then uploaded from the files and saved for reading and further analysis
from google.colab import files
uploaded = files.upload()
import io
cust_data = pd.read_csv(io.BytesIO(uploaded['Churn_Modelling.csv']))
# Dataset is now stored in a Pandas Dataframe
cust_data.shape
cust_data.columns
# Percentage per category for the target column.
percentage_labels = cust_data['Exited'].value_counts(normalize = True) * 100
percentage_labels
Note: Of the 10,000 Bank Customer data provided, 20.37% of them Left the Bank
# Drop the irrelevant columns as shown above
cust_data = cust_data.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)
# Build correlation matrix
corr = cust_data.corr()
corr.style.background_gradient(cmap='PuBu')
The graphical analysis shows all values are very small and less than +0.5 or -0.5. We can therefore say features are not correlated.
#price range correlation
corr.sort_values(by=["CreditScore"], ascending=False).iloc[0].sort_values(ascending=False)
#scatterplot
sns.set()
cols_pr = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts','HasCrCard', 'EstimatedSalary']
sns.pairplot(cust_data[cols_pr], height = 2.0)
plt.show()
Credit Score seems to be normally distributed while
There are high number of customers with low balances and
Most Customers either have one or two bank products.
There some algorithms that can't handle missing data or may perform poorly when present with such data. We check and appropriate handle any missing values for that matter.
#missing data
total_null = cust_data.isnull().sum().sort_values(ascending=False)
percent_null = (cust_data.isnull().sum()/cust_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_null, percent_null], axis=1, keys=['Total', 'Percent'])
missing_data
import missingno as msno
import matplotlib.pyplot as plt
msno.bar(cust_data)
plt.show()
cust_data.describe()
from itertools import chain
countmale = cust_data[cust_data['Gender']=='Male']['Gender'].count()
countfemale = cust_data[cust_data['Gender']=='Female']['Gender'].count()
fig,aix = plt.subplots(figsize=(8,6))
#print(countmale)
#print(countfemale)
aix = sns.countplot(hue='Exited',y='Geography',data=cust_data)
Note: 0 = 'Male' 1 = 'Female'
from matplotlib import rcParams
# figure size in inches
#rcParams['figure.figsize'] = 11.7,8.27
g = sns.FacetGrid(cust_data,hue = 'Exited', height = 6.27, aspect=9.7/6.27)
(g.map(plt.hist,'Age',edgecolor="w").add_legend())
sns.kdeplot(cust_data.CreditScore[cust_data.Gender=='Male'], label='Men', shade=True)
sns.kdeplot(cust_data.CreditScore[cust_data.Gender=='Female'], label= 'Women', shade=True)
plt.xlabel('Credit Score')
Credit Score data tends to be normally distributed both in men and women
sns.kdeplot(cust_data.Exited[cust_data.Gender=='Male'], label='Men', shade=True)
sns.kdeplot(cust_data.Exited[cust_data.Gender=='Female'], label= 'Women', shade=True)
plt.xlabel('Exited')
# Drop the irrelevant columns as shown above
training_data = cust_data
training_data.head()
#Separating churn and non churn customers
churn = training_data[training_data["Exited"] == 1]
not_churn = training_data[training_data["Exited"] == 0]
target_col = ["Exited"]
cat_cols = training_data.nunique()[training_data.nunique() < 6].keys().tolist()
cat_cols = [x for x in cat_cols if x not in target_col]
num_cols = [x for x in training_data.columns if x not in cat_cols + target_col]
Setting Display Function for Colab
def configure_plotly_browser_state():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
},
});
</script>
'''))
# Setting to Allow for Colab Display of interactive Plots
configure_plotly_browser_state()
# Function which plot box plot for detecting outliers
trace = []
def gen_boxplot(df):
for feature in df:
trace.append(
go.Box(
name = feature,
y = df[feature]
)
)
new_df = training_data[num_cols[:1]]
gen_boxplot(new_df)
data = trace
py.iplot(data)
Credit Score data is normally didtributed but with a concentration of low Credit Score for substantial number
of customers
configure_plotly_browser_state()
# Function which plot box plot for detecting outliers
trace = []
def gen_boxplot(df):
for feature in df:
trace.append(
go.Box(
name = feature,
y = df[feature]
)
)
new_df = training_data[num_cols[1:3]]
gen_boxplot(new_df)
data = trace
py.iplot(data)
While the lower age limit for banking is determined by law and pegged at minimum age of 18, there is no upper age limit and Banks have a good number of Customer well past the upper quartle of Age 63 years.
#function for histogram for customer churn types
def histogram(column) :
trace1 = go.Histogram(x = churn[column],
histnorm= "percent",
name = "Churn Customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .9
)
trace2 = go.Histogram(x = not_churn[column],
histnorm = "percent",
name = "Non Churn Customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .9
)
data = [trace1,trace2]
layout = go.Layout(dict(title =column + " Distribution in Customer Attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = column,
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "percent",
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
)
)
fig = go.Figure(data = data,layout=layout)
py.iplot(fig)
configure_plotly_browser_state()
# Calling the function for plotting the histogram for creditscore column
histogram(num_cols[0])
It is good to note that a section of Customers with very good Credit Card Score also churn.
We can hypothesize that, Banks compete for these High value Customers and therefore are likely to be given great offers by competing banks to join them.
configure_plotly_browser_state()
# Calling the function for plotting the histogram for creditscore column
# Pass the mouse hover the graph for more information.
histogram(num_cols[1])
From the above graph, Bank Customers who churn tend to be older compared to those who do note.
1. Older Customers tend to look for value adding products from their Banks and if not available they can easily switch Banks.
2. Older Customers are a very busy lot and value their time and therefore poor customer service can easily drive them away.
configure_plotly_browser_state()
# Calling the function for plotting the histogram for balance column
histogram(num_cols[3])
#function for pie plot for customer attrition types
def plot_pie(column) :
trace1 = go.Pie(values = churn[column].value_counts().values.tolist(),
labels = churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
domain = dict(x = [0,.48]),
name = "Churn Customers",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")
),
hole = .6
)
trace2 = go.Pie(values = not_churn[column].value_counts().values.tolist(),
labels = not_churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")
),
domain = dict(x = [.52,1]),
hole = .6,
name = "Non churn customers"
)
layout = go.Layout(dict(title = column + " Distribution in Customer Attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
annotations = [dict(text = "churn customers",
font = dict(size = 13),
showarrow = False,
x = .15, y = .5),
dict(text = "Non churn customers",
font = dict(size = 13),
showarrow = False,
x = .88,y = .5
)
]
)
)
data = [trace1,trace2]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)
configure_plotly_browser_state()
# Calling the function for plotting the pie plot for geography column
plot_pie(cat_cols[0])
Percentage wise, Spain had 20.3% at half the %age of those who exited than that of Germany and France
configure_plotly_browser_state()
# Calling the function for plotting the pie plot for gender column
plot_pie(cat_cols[1])
Female Customers are the ones who are most likely to leave the Banks.
cust_data.columns
cust_data.head()
X_columns = cust_data.columns.tolist()[0:10]
y_columns = cust_data.columns.tolist()[-1:]
print(f'All columns: {cust_data.columns.tolist()}')
print()
print(f'X values: {X_columns}')
print()
print(f'y values: {y_columns}')
X = cust_data[X_columns].values # Credit Score through Estimated Salary
y = cust_data[y_columns].values # Exited
print(X[:5])
# Encoding categorical (string based) data. Country: there are 3 options: France, Spain and Germany
# This will convert those strings into scalar values for analysis
print(X[:8,1], '... will now become: ')
from sklearn.preprocessing import LabelEncoder
label_X_country_encoder = LabelEncoder()
X[:,1] = label_X_country_encoder.fit_transform(X[:,1])
print(X[:8,1])
# We will do the same thing for gender. this will be binary in this dataset
print(X[:6,2], '... will now become: ')
from sklearn.preprocessing import LabelEncoder
label_X_gender_encoder = LabelEncoder()
X[:,2] = label_X_gender_encoder.fit_transform(X[:,2])
print(X[:6,2])
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
pipeline = Pipeline(
[('Categorizer', ColumnTransformer(
[ # Gender
("Gender Label encoder", OneHotEncoder(categories='auto', drop='first'), [2]),
# Geography
("Geography One Hot", OneHotEncoder(categories='auto', drop='first'), [1])
], remainder='passthrough', n_jobs=1)),
# Standard Scaler for the classifier
('Normalizer', StandardScaler())
])
X
y
X_data = pipeline.fit_transform(X)
# Splitting the dataset into the Training and Testing set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data,
y,
test_size = 0.2,
random_state = 0)
print(f'training shapes: {X_train.shape}, {y_train.shape}')
print(f'testing shapes: {X_test.shape}, {y_test.shape}')
With the advancement of data-driven Machine Learning, It is now possible to identify the potential inactive customers that are likely to churn and take measurable steps to retain them quickly.
import keras
from keras.models import Sequential
from keras.layers import Dense #to add layers
classifier = Sequential()
#init --> initialize weights according to uniform distribution
#input_dim is required for the first hidden layer, as it is the first starting point. --> number of nodes.
#output_dim --> number of nodes of the hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))
#input_dim --> remove it as it already knows what to expect.
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))
#the output layer: output_dim should be 1, as output is binary outcome, and activation should be 'sigmoid'
#If dependent variables have more than two categories, use activation = 'softmax'
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
#compile the model --> backpropagation -> gradient descent
#optimizer = algorithm to find the optimal set of weights in ANN
#loss = functions that should be optimized. if more than two categories, use "categorical_crossentropy"
#metrics = criterion used to calculate the performance of the model.
classifier.compile(optimizer = 'adam', loss = "binary_crossentropy", metrics = ['accuracy'])
#batch_size = the number of observations after which you want to update the weights
# batch size and epochs should be tuned through experiments.
#epoch = going through the whole dataset
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 40)
#predicting the results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5) #to classify each probability into True or False
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print (cm, '\n\n', y_pred[:10, :])
# accuracy
print('The Test Accuracy is: ',((1547 + 138)/2000)*100,'%')
from keras.models import Sequential
from keras.layers import Dense, Dropout
# Initializing the ANN
classifier = Sequential()
# This adds the input layer (by specifying input dimension) AND the first hidden layer (units)
classifier.add(Dense(6,
activation = 'relu',
input_shape = (X_train.shape[1], )))
classifier.add(Dropout(rate=0.1))
# Adding the second hidden layer
# Notice that we do not need to specify input dim.
classifier.add(Dense(6, activation = 'relu'))
classifier.add(Dropout(rate=0.1))
# Adding the output layer
# Notice that we do not need to specify input dim.
# we have an output of 1 node, which is the the desired dimensions of our output (stay with the bank or not)
# We use the sigmoid because we want probability outcomes
classifier.add(Dense(1, activation = 'sigmoid'))
classifier.compile(optimizer='adam', loss = 'binary_crossentropy', metrics=['accuracy'])
history = classifier.fit(X_train, y_train, batch_size=32, epochs= 200, validation_split=0.1, verbose=2)
classifier.summary()
Predicting The Test Set Results
plt.plot(np.array(history.history['accuracy']) * 100)
plt.plot(np.array(history.history['val_accuracy']) * 100)
plt.ylabel('accuracy')
plt.xlabel('epochs')
plt.legend(['train', 'validation'])
plt.title('Accuracy over epochs')
plt.show()
y_pred = classifier.predict(X_test)
print(y_pred[:5])
y_pred = (y_pred > 0.5).astype(int)
print(y_pred[:8])
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print (((cm[0][0]+cm[1][1])*100)/(len(y_test)), '% of testing data was classified correctly')
At 86.5% Test Accuracy, the Neural Network performance is satisfactory.
It is good to note that, Neural Net with Dropout improves model accuracy significantly.
To know which input variables / columns / data attributes / features give us at least baseline accuracy, we use a RandomForestClassifier from sklearn package.
We plot a feature importance graph as shown below.
# One-Hot encoding our categorical attributes
list_cat = ['Geography', 'Gender']
training_data = pd.get_dummies(training_data, columns = list_cat, prefix = list_cat)
# Import the Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# We perform training on the Random Forest model and generate the importance of the features
X1 = training_data.drop('Exited', axis=1)
y1 = training_data.Exited
features_label = X1.columns
forest = RandomForestClassifier (n_estimators = 10000, random_state = 0, n_jobs = -1)
forest.fit(X1, y1)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range(X1.shape[1]):
print ("%2d) %-*s %f" % (i + 1, 30, features_label[i], importances[indices[i]]))
# Visualization of the Feature importances
plt.title('Feature Importances')
plt.bar(range(X1.shape[1]), importances[indices], color = "green", align = "center")
plt.xticks(range(X1.shape[1]), features_label, rotation = 55)
plt.show()
The most important features to consider are Credit Score, Age, Tenure, Balance and the
Number of products one has with the Bank.
According to above graph, these features can determine if a Customer Churns or not.
K-Nearest Neighbor (KNN)
Logistic Regression (LR)
AdaBoost
Gradient Boosting (GB)
RandomForest (RF)
# Import selected models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Scoring function
from sklearn.metrics import roc_auc_score, roc_curve
X1 = training_data.drop('Exited', axis=1)
y1 = training_data.Exited
# Splitting the dataset in training and test set
X1_train, X1_test, y1_train, y1_test = train_test_split(X1,
y1,
test_size = 0.2)
# Initialization of the KNN
knMod = KNeighborsClassifier(n_neighbors = 5,
weights = 'uniform',
algorithm = 'auto', leaf_size = 30, p = 2,
metric = 'minkowski', metric_params = None)
# Fitting the model with training data
knMod.fit(X1_train, y1_train)
# Initialization of the Logistic Regression
lrMod = LogisticRegression(penalty = 'l2', dual = False, tol = 0.0001, C = 1.0,
fit_intercept = True,
intercept_scaling = 1, class_weight = None,
random_state = None, solver = 'liblinear', max_iter = 100,
multi_class = 'ovr', verbose = 2)
# Fitting the model with training data
lrMod.fit(X1_train, y1_train)
# Initialization of the AdaBoost model
adaMod = AdaBoostClassifier(base_estimator = None,
n_estimators = 200,
learning_rate = 1.0)
# Fitting the model with training data
adaMod.fit(X1_train, y1_train)
# Initialization of the GradientBoosting model
gbMod = GradientBoostingClassifier(loss = 'deviance', n_estimators = 200)
# Fitting the model with training data
gbMod.fit(X1_train, y1_train)
# Initialization of the Random Forest model
rfMod = RandomForestClassifier(n_estimators=10, criterion='gini')
# Fitting the model with training data
rfMod.fit(X1_train, y1_train)
Testing the trained models performance against a validation set. The metrics we use are the mean accuracy score and the AUC-ROC score.
# Compute the model accuracy on the given test data and labels
knn_acc = knMod.score(X1_test, y1_test)
# Return probability estimates for the test data
test_labels = knMod.predict_proba(X1_test)[:,1]
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
knn_roc_auc = roc_auc_score(y1_test, test_labels , average = 'macro', sample_weight = None)
X1_test.head()
test_labels
# Compute the model accuracy on the given test data and labels
lr_acc = lrMod.score(X1_test, y1_test)
# Return probability estimates for the test data
test_labels = lrMod.predict_proba(np.array(X1_test.values))[:,1]
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
lr_roc_auc = roc_auc_score(y1_test, test_labels , average = 'macro', sample_weight = None)
# Compute the model accuracy on the given test data and labels
ada_acc = adaMod.score(X1_test, y1_test)
# Return probability estimates for the test data
test_labels = adaMod.predict_proba(np.array(X1_test.values))[:,1]
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
ada_roc_auc = roc_auc_score(y1_test, test_labels , average = 'macro')
# Compute the model accuracy on the given test data and labels
gb_acc = gbMod.score(X1_test, y1_test)
# Return probability estimates for the test data
test_labels = gbMod.predict_proba(np.array(X1_test.values))[:,1]
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
gb_roc_auc = roc_auc_score(y1_test, test_labels , average = 'macro')
# Compute the model accuracy on the given test data and labels
rf_acc = rfMod.score(X1_test, y1_test)
# Return probability estimates for the test data
test_labels = rfMod.predict_proba(np.array(X1_test.values))[:,1]
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores
rf_roc_auc = roc_auc_score(y1_test, test_labels , average = 'macro')
models = ['KNN', 'Logistic Regression', 'AdaBoost', 'GradientBoosting', 'Random Forest']
accuracy = [knn_acc, lr_acc, ada_acc, gb_acc, rf_acc]
roc_auc = [knn_roc_auc, lr_roc_auc, ada_roc_auc, gb_roc_auc, rf_roc_auc]
d = {'accuracy': accuracy, 'roc_auc': roc_auc}
df_metrics = pd.DataFrame(d, index = models)
df_metrics
fpr_knn, tpr_knn, _ = roc_curve(y1_test, knMod.predict_proba(np.array(X1_test.values))[:,1])
fpr_lr, tpr_lr, _ = roc_curve(y1_test, lrMod.predict_proba(np.array(X1_test.values))[:,1])
fpr_ada, tpr_ada, _ = roc_curve(y1_test, adaMod.predict_proba(np.array(X1_test.values))[:,1])
fpr_gb, tpr_gb, _ = roc_curve(y1_test, gbMod.predict_proba(np.array(X1_test.values))[:,1])
fpr_rf, tpr_rf, _ = roc_curve(y1_test, rfMod.predict_proba(np.array(X1_test.values))[:,1])
# Plot the roc curve
plt.figure(figsize = (12,6), linewidth= 1)
plt.plot(fpr_knn, tpr_knn, label = 'KNN Score: ' + str(round(knn_roc_auc, 5)))
plt.plot(fpr_lr, tpr_lr, label = 'LR score: ' + str(round(lr_roc_auc, 5)))
plt.plot(fpr_ada, tpr_ada, label = 'AdaBoost Score: ' + str(round(ada_roc_auc, 5)))
plt.plot(fpr_gb, tpr_gb, label = 'GB Score: ' + str(round(gb_roc_auc, 5)))
plt.plot(fpr_rf, tpr_rf, label = 'RF score: ' + str(round(rf_roc_auc, 5)))
plt.plot([0,1], [0,1], 'k--', label = 'Random guessing: 0.5')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve ')
plt.legend(loc='best')
plt.show()
The ROC-AUC score, is more significant for us because the mean accuracy score considers only one threshold value, whereas the ROC-AUC score takes into consideration all possible threshold values and gives us the score. score() function of scikit-learn, which give us the mean accuracy score.
Note:
GradientBoosting score of 0.86 and AdaBoost 0.83 classifiers show high ROC-AUC score on the validation dataset.
Other classifiers, such as logistic regression, KNN, and RandomForest do not perform well on the validation set.
Therefore we shall fine tune GradientBoosting and AdaBoost classifiers in order to improve their accuracy score.
We perform some parameter optimization in the following steps.
In this section, I will use the following techniques in order to improve the accuracy of the classifiers :
With Cross Validation, instead of splitting the valuable training data into a separate training and validation set, I use KFold cross validation.
The models we shall tune are:
- AdaBoost and
- Gradient Boosting Machine
AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. It is also senstive to noisy data and outliers.
GBM is an ensemble method that works by training many individual learners, almost always decision trees where the trees are trained sequentially with each tree learning from the mistakes of the previous ones unlike in a random forest where the trees are trained in parallel.
Since a GBM model has many hyperparameters with complex interactions between them, the only way to find the optimal hyperparameter values is to try many different combinations on a dataset. A range of hyperparameters that control both the overall ensemble (such as the learning rate) and the individual decision trees (such as the number of leaves in the tree or the maximum depth of the tree) is determined through a random search.
The performance of each set of hyperparameters is determined by Receiver Operating Characteristic Area Under the Curve (ROC AUC) from the cross-validation
Here, we can use the Scikit-Learn functions RandomizedSearchCV or GridSearchCV.
To implement K-folds cross-validation, we use the K value = 5
# Import the cross-validation module
from sklearn.model_selection import cross_val_score
# Function that will track the mean value and the standard deviation of the accuracy
def cvDictGen(functions, scr, X1_train = X1, y1_train = y1, cv = 5):
cvDict = {}
for func in functions:
cvScore = cross_val_score(func, X1_train, y1_train, cv = cv, scoring = scr)
cvDict[str(func).split('(')[0]] = [cvScore.mean(), cvScore.std()]
return cvDict
mod = [knMod, lrMod, adaMod, gbMod, rfMod]
cvD = cvDictGen(mod, scr = 'roc_auc')
cvD
Based on the mean value and the standard deviation value, we can conclude that our ROC-AUC score does not deviate much, so we are not suffering from the overfitting issue.
I use RandomizedSearchCV and GridSearchCV hyperparameter tuning method.
Stacking Random and Grid Search
For a smarter implementation of hyperparameter tuning I combine random search and grid search as follows:
1. Use random search with a large hyperparameter grid
2. Use the results of random search to build a focused hyperparameter grid around the best performing hyperparameter values.
3. Run grid search on the reduced hyperparameter grid.
4. Repeat grid search on more focused grids until maximum computational/time budget is exceeded.
The following steps help in obtaining optimal values for the parameters.
# Import methods
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
# Possible parameters
adaHyperParams = {'n_estimators': [10,50,100,200,420]}
gridSearchAda = RandomizedSearchCV(estimator = adaMod, param_distributions = adaHyperParams, n_iter = 5,
scoring = 'roc_auc')
gridSearchAda.fit(X1_train, y1_train)
# Display the best parameters and the score
gridSearchAda.best_params_, gridSearchAda.best_score_
AdaBoost after hyper parameter tuning output optimal value is 84.7% from previous 83.4%.
# Possibles parameters
gbHyperParams = {'loss' : ['deviance', 'exponential'],
'n_estimators': randint(10, 500),
'max_depth': randint(1,10)}
# Initialization
gridSearchGB = RandomizedSearchCV(estimator = gbMod, param_distributions = gbHyperParams, n_iter = 10,
scoring = 'roc_auc')
# Fitting the model
gridSearchGB.fit(X1_train, y1_train)
gridSearchGB.best_params_, gridSearchGB.best_score_
GradientBoosting after hyper parameter tuning output optimal value has remained same at 86.2%.
Here we are going to use the optimal parameter values that we got from the hyperparameter tuning.
# GradientBoosting with the optimal parameters
bestGbModFitted = gridSearchGB.best_estimator_.fit(X1_train, y1_train)
# AdaBoost with the optimal parameter
bestAdaModFitted = gridSearchAda.best_estimator_.fit(X1_train, y1_train)
functions = [bestGbModFitted, bestAdaModFitted]
cvDictbestpara = cvDictGen(functions, scr = 'roc_auc')
cvDictbestpara
# Getting the score GradientBoosting
test_labels = bestGbModFitted.predict_proba(np.array(X1_test.values))[:,1]
roc_auc_score(y1_test,test_labels , average = 'macro', sample_weight = None)
# Getting the score AdaBoost
test_labels = bestAdaModFitted.predict_proba(np.array(X1_test.values))[:,1]
roc_auc_score(y1_test,test_labels , average = 'macro', sample_weight = None)
Since hyper parameter tuning has not improved models much, we try transformation of features. Then use ensemble technique (a voting mechanism) in order to generate the final probability of the prediction on the actual test dataset so that we can get the best accuracy score.
We will apply standard scaler/log transformation to our training dataset. The reason behind this is that we have some attributes that are very skewed and some data attributes that have values that are more spread out in nature.
# Import the log transformation method
from sklearn.preprocessing import FunctionTransformer, StandardScaler
transformer = FunctionTransformer(np.log1p)
scaler = StandardScaler()
X_train_1 = np.array(X1_train)
#X_train_transform = transformer.transform(X_train_1)
X_train_transform = scaler.fit_transform(X_train_1)
bestGbModFitted_transformed = gridSearchGB.best_estimator_.fit(X_train_transform, y1_train)
bestAdaModFitted_transformed = gridSearchAda.best_estimator_.fit(X_train_transform, y1_train)
cvDictbestpara_transform = cvDictGen(functions = [bestGbModFitted_transformed, bestAdaModFitted_transformed],
scr='roc_auc')
cvDictbestpara_transform
# For the test set
X_test_1 = np.array(X1_test)
#X_test_transform = transformer.transform(X_test_1)
X_test_transform = scaler.fit_transform(X_test_1)
test_labels=bestGbModFitted_transformed.predict_proba(np.array(X_test_transform))[:,1]
roc_auc_score(y1_test,test_labels , average = 'macro', sample_weight = None)
In this section, we will use a voting-based ensemble classifier. So, we implement a voting-based machine learning model for both untransformed features as well as transformed features. Let's see which version scores better on the validation dataset.
# Import the voting-based ensemble model
from sklearn.ensemble import VotingClassifier
# Initialization of the model
votingMod = VotingClassifier(estimators=[('gb', bestGbModFitted_transformed),
('ada', bestAdaModFitted_transformed)],
voting = 'soft', weights = [2,1])
# Fitting the model
votingMod = votingMod.fit(X_train_transform, y1_train)
test_labels=votingMod.predict_proba(np.array(X_test_transform))[:,1]
votingMod.score(X_test_transform, y1_test)
# The roc_auc score
roc_auc_score(y1_test, test_labels , average = 'macro', sample_weight = None)
# Initialization of the model
votingMod_old = VotingClassifier(estimators = [('gb', bestGbModFitted), ('ada', bestAdaModFitted)],
voting = 'soft', weights = [2,1])
# Fitting the model
votingMod_old = votingMod.fit(X1_train, y1_train)
test_labels = votingMod_old.predict_proba(np.array(X1_test.values))[:,1]
votingMod.score(X1_test, y1_test)
# The roc_auc score
roc_auc_score(y1_test,test_labels , average = 'macro', sample_weight = None)
As per industry standards, the best generalized approximation of churn problem in Banking sector is 87%, which compares well with our achieved Accuracy Score of 86.7% and ROC AUC value of 87%.
Artificial Neural Network performance was impressive at 86.7% Accuracy and very comparative to AdaBoost and GradientBoosting Classifiers both at 85%.
Not included in this report.
1. Female customers are the most likely to churn,
2. Customers that are located in Germany are the most churned, and also
3. Customers using only one product are the most churned.
After building several models, I ended up with three very promising ones namely:
Keras Neural Network with Dropout,
GradientBoosting and
AdaBoost
which performed better than KNN and Random Forest.
This will allow to choose the best model.
I compared several algorithms which included random forest, KNN, AdaBoost, GradientBoosting and neural networks for the same. The accuracy of both the AdaBoost and GradientBoosting algorithms are comparable to the Neural Networks , hence it is hard to tell which is better.
I dicided to tune the hyper parameter for AdaBoost and GradientBoosting by implemented a voting-based approach as their initial performance was low but better than KNN and RF.
Neural Network with dropout has proven to be a great algorithm if the dataset is well prepared and clean. AdaBoost and GradientBoosting required hyper parameter tuning but less preprocessing and the training process is also much simpler.
This shows with more hyper-parameter tuning neural networks will yield higher performance than both.
Since the problem is about binary classification with a imbalance dataset, we have used the most efficient metric for model performance which is the ROC-AUC score and the model achieved about 87% accuracy.
This score compares very well as per industry standards on Banking Churn prediction rates. The model can achieve better performance providing a lot of historical data for the training phase.
In this project, I demonstrated how a business can predict customers likely to exit. With this information, a customer retension plan can be more effective and less costly as its targeted to a specific group.
Although Churn Prediction models do a great job at predicting which custoers may churn, Single models have challenges in solving the problem at hand. The fact that:
1. No clarity about customer value: A churn prediction model doesn't tell you which of the identified at-risk customers is more valuable. In this scenario, your retention agents end up giving costly offers to low-value customers
2. No insight into the context of risk: Not knowing why a particular customer would want to cancel in the first place makes it harder to know what to do to retain them.
3. It doesn't allow timely and proactive engagement: Models don't provide information on time aspect to know when a predicted high-risk customer will cancel and therefore Marketers can't prioritize their planning on who to target first.
4. The lost opportunity of customer winback: Winning back lost customers is more profitable than new customer acquisition. The single model approach will only predict the risk status of active customers and won't even consider winback chances of recently canceled customers.
Since multiple factors drive a customer's desire to cancel, the objective of predicting churn analytics should go from identifying who is most likely to cancel to understanding finer and subtler details like why customers will cancel, when will they cancel, how valuable these customers are, what can be done to save them, what offers might work for them and which of the canceled customers can be won back.
I therefore recommend a Multimodel approach each predicting different dimensions of customer behavior to target various aspects of the problem.
%reset -f